Lab 01 Tutorial

BSTA 6100 Fall 2025 Lab 01

Nicholas J. Seewald, PhD

2025-09-16

CSV files: a common way to store data

CSV stands for “comma separated values” and is a commonly used file type for storing data. Open the file “penguins.csv” from the files pane (lower right) to see what a .csv file looks like:

CSV file structure

Each row of the file is an “observation” or “case”, and consists of one or more variables whose values are separated by commas (hey, look at that). The first row contains the names of the variables contained in the file.

"species","island","bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g","sex","year" "Adelie","Torgersen",39.1,18.7,181,3750,"male",2007 "Adelie","Torgersen",39.5,17.4,186,3800,"female",2007

Palmer Penguins Data

We’re going to start by working with a data set with data on 333 penguins collected from 3 islands in the Palmer Archipeligo in Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network, and the data were prepared by Dr. Allison Horst.

Artwork by

Reading CSV files into R

We can read data into R using a function called read.csv(). The first argument to read.csv() is the name of a .csv file (here, penguins.csv), in quotes. We then store the results of read.csv() as an object called penguins.

penguins <- read.csv("penguins.csv", stringsAsFactors = TRUE)

The penguins object is called a data.frame.

Using head() to peek at a data.frame

Let’s see what’s in the data. We can peek at the first few (6, specifically) rows of the data using the head() function:

head(penguins)
  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen           36.7          19.3               193        3450
5  Adelie Torgersen           39.3          20.6               190        3650
6  Adelie Torgersen           38.9          17.8               181        3625
     sex year
1   male 2007
2 female 2007
3 female 2007
4 female 2007
5   male 2007
6 female 2007

We read that line as “head of penguins”. Remember that penguins is what we named our data set. We can see that penguins contains a number of variables, like species, island, and more.

Variable name Description
species Penguin species (Adélie, Chinstrap, and Gentoo)
island Island in Palmer Archipeligo, Antarctica, on which the penguin was observed (Biscoe, Dream, or Torgersen)
bill_length_mm A number denoting bill length (in millimeters)
bill_depth_mm A number denoting bill depth (in millimeters)
flipper_length_mm A whole number denoting flipper length (in millimeters)
body_mass_g A whole number denoting penguin body mass (in grams)
sex Penguin sex (female, male)
year Study year (2007, 2008, 2009)

Using str() to peek at a data.frame

We can also peek at the data using a function called str() (pronounced “stir”, short for “structure”):

str(penguins)
'data.frame':   333 obs. of  8 variables:
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Frequency Tables

Let’s start with the species variable. Is this a categorical or quantitative variable? How do you know?

To make a frequency table of a categorical variable, we use the table() function:

table(penguins$species)

   Adelie Chinstrap    Gentoo 
      146        68       119 

So, there are 119 Gentoo penguins in the data.

Proportion Tables

Pass a table to prop.table() to get a table of proportions:

prop.table(table(penguins$species))

   Adelie Chinstrap    Gentoo 
0.4384384 0.2042042 0.3573574 

The $ Operator

Notice that we passed penguins$species to table(): we had to identify the data.frame that contains the variable species. The dollar sign ($) tells R to look inside the object on the left for the object on the right.

It’s very important that you tell R which data frame the variable you’re interested in is from. Let’s see what happens when we don’t:

species
Error: object 'species' not found

Side note 1: If you were ever taught to use the attach() function to load a data.frame into the namespace, don’t do that!

Side note 2: When writing text in Quarto, if you want to use $, you must escape it with \. Write \$500 instead of $500.

Two-Way Frequency Tables

We can also make “two-way” frequency tables (sometimes called “contingency tables”) to summarize counts for two categorical variables:

table(penguins$species, penguins$island)
           
            Biscoe Dream Torgersen
  Adelie        44    55        47
  Chinstrap      0    68         0
  Gentoo       119     0         0

Data is Really Cool, so the first variable you give to table() is in the rows of the table, and the second is in the columns.

Bar Charts in R

Bar graphs / charts / plots can be used to visualize categorical data.

barplot(table(penguins$species),
     xlab = "Species",
     ylab = "Frequency",
     main = "Bar Chart of Number of Penguins of Each Species Observed",
     col = c("darkorange1", "mediumorchid2", "darkcyan"))

Bar Charts in R

barplot(table(penguins$species),
     xlab = "Species",
     ylab = "Frequency",
     main = "Bar Chart of Number of Penguins of Each Species Observed",
     col = c("darkorange1", "mediumorchid2", "darkcyan"))
  • Note that we pass a table object to barplot() – the function takes “heights” as input.
  • xlab is the x-axis label (in quotes)
  • ylab is the y-axis label (in quotes)
  • main is the main title (in quotes)
  • col is a vector of color names that are applied in order of the entries in the passed table.

Numerical Summaries

Let’s start with the flipper_length_mm variable. Is this a categorical or quantitative variable? How do you know?

We can use R to summarize data numerically. We’ll use the summary() function to do that for a given variable. Here, we’ll summarize the flipper_length_mm variable, which is the length of the penguins’ flippers (in millimeters).

summary(penguins$flipper_length_mm)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    172     190     197     201     213     231 

You can always get just the one numerical summary you’re looking for using the function for that specific summary:

min(penguins$flipper_length_mm)
[1] 172
mean(penguins$flipper_length_mm)
[1] 200.967
median(penguins$flipper_length_mm)
[1] 197
max(penguins$flipper_length_mm)
[1] 231
sd(penguins$flipper_length_mm)
[1] 14.01577
IQR(penguins$flipper_length_mm)
[1] 23

Boxplots in R

Boxplots visualize the “5-number summary” (min, Q1, median, Q3, max) of a quantitative variable.

boxplot(penguins$flipper_length_mm,
        main = "Boxplot of Penguin Flipper Length",
        ylab = "Flipper Length (mm)")

Histograms in R

Histograms can be used to visualize the distribution of a quantitative variable.

hist(penguins$flipper_length_mm)

Titles Are Important

Notice the unprofessional title and x-axis label: hardly anybody other than you understands your variable naming syntax.

Always provide main, xlab, and ylab arguments as appropriate when making plots, unless you’re doing something fast that you won’t show anyone else.

Histograms in R

Here’s something better:

hist(penguins$flipper_length_mm,
     main = "Histogram of Penguin Flipper Length",
     xlab = "Flipper Length (mm)")

Subsetting

Sometimes we want to only look at a certain section of our data. To do this, we’ll create a subset.

chinstrap <- subset(penguins, species == "Chinstrap")
  • First argument is the data.frame you want to subset
  • Second argument is a logical expression (run ?Comparison in the R console for help)
    • Note the double equals (==)! This is logical equals, which is a comparison operator. = is an assigment operator, like <-.

Logical Expressions

chinstrap <- subset(penguins, species == "Chinstrap")

Logical expressions (e.g., species == "Chinstrap") implicitly create TRUE/FALSE (“Boolean”) objects in R. The statement will be TRUE when an observation of the species variable is exactlyChinstrap” (case-sensitive) and FALSE otherwise.

TRY IT! Fill in the chunk below to create subsets for the other species of penguin.

adelie <- subset(penguins, species == "")
gentoo <- subset(penguins, species == "")

An alternative subset method

R’s data.frames inherit properties of arrays, which have rows and columns. (Remember that arrays are Really Cool, so we always write rows, columns.)

We can select particular rows or columns using logical expressions using square brackets []:

chinstrap2 <- penguins[penguins$species == "Chinstrap", ]

Since penguins is a two-dimensional array (like all data.frames), we must specify conditions for both rows and columns. Leaving the blank space after the comma tells R to select all columns.

all.equal(chinstrap, chinstrap2)
[1] TRUE

Subsetting vectors

Every variable in a data.frame is a vector: a 1-dimensional object. To subset it, we need only provide conditions on that single dimension.

Let’s subset body_mass_g by sex.

table(penguins$sex)

female   male 
   165    168 

The sex variable in this data is either female or male (note the lowercase names!).

male_body_mass <- penguins$body_mass_g[penguins$sex == "male"]
female_body_mass <- penguins$body_mass_g[penguins$sex == "female"]

Scatterplots in R

A scatterplot is a way to visualize relationships between two numeric variables. On the x-axis is typically the “explanatory” variable (denoted \(x\)), and on the y-axis is the “response” variable (denoted \(y\)). The data is paired (x,y), then each pair is plotted using an open circle.

The plot() function, when given two numeric variables, will create a scatterplot. The first argument to plot() is on the x axis; the second, on the y axis.

plot(penguins$bill_length_mm, penguins$body_mass_g,
     main = "Scatterplot of Penguin Bill Length versus Body Mass",
     xlab = "Bill Length (mm)",
     ylab = "Body Mass (g)")

Describing Scatterplots

When describing a bivariable relationship in a scatterplot, focus on:

  • Shape: is the relationship linear? non-linear?
  • Sign: is the general slope positive or negative?
  • Strength: how clear is the shape?
  • Unusual points: are there any points that noticeably deviate from the overall shape?
  • Clustering: Do the points group together in noticeable ways?

Describing Scatterplots

  • Shape
  • Sign
  • Strength
  • Unusual points
  • Clustering

Adding Color to Scatterplots

Notice that there might be some clustering happening. Let’s color the plot by species to see if that might explain what we’re seeing.

plot(penguins$flipper_length_mm, penguins$body_mass_g,
     main = "Scatterplot of Body Mass vs. Flipper Length",
     xlab = "Flipper Length (mm)",
     ylab = "Body Mass (mm)",
     col = c("darkorange1", "mediumorchid2", "darkcyan")[penguins$species])

legend("topleft",
       legend = c("Adelie", "Chinstrap", "Gentoo"),
       col = c("darkorange1", "mediumorchid2", "darkcyan"),
       pch = 1)

NOTE: The information in the legend is not tied to the plot by default. You can make a nonsense legend if you want (you don’t want this). Make sure your legend matches your plot!

Changing Plotting Characters

Use the pch (plotting character) argument to plot(). Set pch to the number corresponding to the point you want. The default is 1, an open circle.

Let’s change the pch argument so that each species has a different color and plotting character.

plot(penguins$flipper_length_mm, penguins$body_mass_g,
     main = "Scatterplot of Body Mass vs. Flipper Length",
     xlab = "Flipper Length (mm)",
     ylab = "Body Mass (mm)",
     col = c("darkorange1", "mediumorchid2", "darkcyan")[penguins$species],
     pch = c(0, 1, 2)[penguins$species])

legend("topleft",
       legend = c("Adelie", "Chinstrap", "Gentoo"),
       col = c("darkorange1", "mediumorchid2", "darkcyan"),
       pch = c(0, 1, 2))

Use Color Meaningfully and with Restraint

The primary function of a graphical display is to convey information. Everything that goes on your plot needs to have a purpose and must convey information.

Use color only to convey information, and don’t rely on it too much.

  • Color should only be used to convey differences of meaning in the data.
  • Many people are colorblind! Avoid red-green and blue-yellow combinations.
  • Have a fallback option to ensure clarity, like different plotting characters.
  • Use “HCL” (hue, chroma, luminance) color scales to choose colors that vary widely.
  • Use ColorBrewer to select palettes https://colorbrewer2.org or the khroma package (“Paul Tol colors”).
  • Assume default colors are chosen poorly (looking at you, ggplot2).

More tips: https://nbisweden.github.io/Rcourse/files/rules_for_using_color.pdf

Typos can be bad!

If you forget the selector on the col or pch arguments, bad things happen!

plot(penguins$flipper_length_mm, penguins$body_mass_g,
     main = "Scatterplot of Body Mass vs. Flipper Length",
     xlab = "Flipper Length (mm)",
     ylab = "Body Mass (mm)",
     col = c("darkorange1", "mediumorchid2", "darkcyan"),
     pch = c(0, 1, 2))

Typos can be bad!

The ~ Operator

In R, we can use ~ (tilde, found underneath the Esc key in the top left corner of a U.S. English keyboard) as an operator that can be read as “by” (or “versus”). This operator has use in making several plots we have discussed in the past.

Let’s make side-by-side boxplots of the numeric variable body_mass_g by species.

boxplot(penguins$body_mass_g ~ penguins$species,
        main = "Side-by-Side Boxplots of Body Mass by Penguin Species",
        xlab = "Species",
        ylab = "Body Mass in Grams")

We could also look at only two species by passing multiple arguments:

boxplot(penguins$body_mass_g[penguins$species == "Adelie"],
        penguins$body_mass_g[penguins$species == "Chinstrap"],
        names = c("Adelie", "Chinstrap"),
        main = "Side-by-Side Boxplots of Body Mass by Penguin Species",
        xlab = "Species",
        ylab = "Body Mass in Grams")

Let’s go back to the scatterplot we made last week and update it to use the ~ operator. We will also update the code to reflect that we can now send to plot the name of the data set using the data argument, letting us skip the $.

plot(body_mass_g ~ bill_length_mm,
     data = penguins,
     main = "Scatterplot of Penguin Body Mass versus Bill Length",
     xlab = "Bill Length (mm)",
     ylab = "Body Mass in (g)")

Notice the order here: the y variable (body_mass_g) is written first, then the tilde, then the x variable (bill_length_mm). This is because for scatterplots, the order is y by x or y ~ x. Be very careful setting up scatterplots!

Miscellaneous Advanced Topics

The Pipe

R has a native “pipe” operator |> that passes the result of the left-hand-side expression to the right-hand-side expression as the first argument in the call.

x |> f(y) is interpreted as f(x, y)

penguins$species |> table() |> prop.table()

   Adelie Chinstrap    Gentoo 
0.4384384 0.2042042 0.3573574 

Arranging plots

Base R graphics rely on “graphical parameters” that are set either inside or outside calls to plotting functions. (See ?par for full details.)

To put two plots side by side, we can set the mfrow graphical parameter (mf might stand for “matrix figure”) before calling plot().

par(mfrow = c(1, 2)) # 1 row, 2 columns

plot(body_mass_g ~ flipper_length_mm, data = penguins,
     main = "Body Mass vs. Flipper Length",
     xlab = "Flipper length (mm)",
     ylab = "Body mass (g)")

plot(body_mass_g ~ bill_length_mm, data = penguins,
     main = "Body Mass vs. Bill Length",
     xlab = "Bill length (mm)",
     ylab = "Body mass (g)")

NOTE: When outside of an RMarkdown or Quarto document, you’ll sometimes need to reset the graphical parameters. Do this by calling dev.off() or by calling par() with the original parameters.

Arranging plots

Line breaks in figure titles

If you have a long figure title, you can break it onto multiple lines with \n:

par(mfrow = c(1, 2))

plot(body_mass_g ~ flipper_length_mm, data = penguins,
     main = "This is a very very long figure title that gets cut off because it's too long",
     xlab = "Flipper length (mm)",
     ylab = "Body mass (g)")

plot(body_mass_g ~ flipper_length_mm, data = penguins,
     main = "This is a very very long figure title\nthat gets cut off because it's too long",
     xlab = "Flipper length (mm)",
     ylab = "Body mass (g)")

Line breaks in figure titles

tapply()

The tapply() applies a function to a matrix. The function can be a predefined R function like mean() or a user-defined function.

The power of tapply() is that it allows for a vector to be split into groups, with the function applied to each group.

The function has the generic structure

tapply(y, x, FUN)

Example: tapply()

# calculate mean flipper length for each penguin species
tapply(penguins$flipper_length_mm, penguins$species, mean)
   Adelie Chinstrap    Gentoo 
 190.1027  195.8235  217.2353 
# summarize flipper length for each penguin species
tapply(penguins$flipper_length_mm, penguins$species, summary)
$Adelie
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  172.0   186.0   190.0   190.1   195.0   210.0 

$Chinstrap
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  178.0   191.0   196.0   195.8   201.0   212.0 

$Gentoo
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  203.0   212.0   216.0   217.2   221.5   231.0